In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.optimize import fmin

np.random.seed(1789)

from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()


Out[1]:

Regression modeling

A general, primary goal of many statistical data analysis tasks is to relate the influence of one variable on another.

For example:


In [2]:
from io import StringIO

data_string = """
Drugs	Score
0	1.17	78.93
1	2.97	58.20
2	3.26	67.47
3	4.69	37.47
4	5.83	45.65
5	6.00	32.92
6	6.41	29.97
"""

lsd_and_math = pd.read_table(StringIO(data_string), sep='\t', index_col=0)
lsd_and_math


Out[2]:
Drugs Score
0 1.17 78.93
1 2.97 58.20
2 3.26 67.47
3 4.69 37.47
4 5.83 45.65
5 6.00 32.92
6 6.41 29.97

Taking LSD was a profound experience, one of the most important things in my life --Steve Jobs


In [3]:
lsd_and_math.plot(x='Drugs', y='Score', style='ro', legend=False, xlim=(0,8))


Out[3]:
<matplotlib.axes._subplots.AxesSubplot at 0x10f93db00>

We can build a model to characterize the relationship between $X$ and $Y$, recognizing that additional factors other than $X$ (the ones we have measured or are interested in) may influence the response variable $Y$.

  • $M(Y|X) = E(Y|X)$
  • $M(Y|X) = Pr(Y=1|X)$

In general,

$$M(Y|X) = f(X)$$

for linear regression

$$M(Y|X) = f(X\beta)$$

where $f$ is some function, for example a linear function:

$y_i = \beta_0 + \beta_1 x_{1i} + \ldots + \beta_k x_{ki} + \epsilon_i$

Regression is a weighted sum of independent predictors

and $\epsilon_i$ accounts for the difference between the observed response $y_i$ and its prediction from the model $\hat{y_i} = \beta_0 + \beta_1 x_i$. This is sometimes referred to as process uncertainty.

Interpretation: coefficients represent the change in Y for a unit increment of the predictor X.

Two important regression assumptions:

  1. normal errors
  2. homoscedasticity

Parameter estimation

We would like to select $\beta_0, \beta_1$ so that the difference between the predictions and the observations is zero, but this is not usually possible. Instead, we choose a reasonable criterion: the smallest sum of the squared differences between $\hat{y}$ and $y$.

$$R^2 = \sum_i (y_i - [\beta_0 + \beta_1 x_i])^2 = \sum_i \epsilon_i^2 $$

Squaring serves two purposes:

  1. to prevent positive and negative values from cancelling each other out
  2. to strongly penalize large deviations.

Whether or not the latter is a desired depends on the goals of the analysis.

In other words, we will select the parameters that minimize the squared error of the model.


In [4]:
sum_of_squares = lambda theta, x, y: np.sum((y - theta[0] - theta[1]*x) ** 2)

In [5]:
sum_of_squares([0,1], lsd_and_math.Drugs, lsd_and_math.Score)


Out[5]:
17159.8154

In [6]:
x, y = lsd_and_math.T.values
b0, b1 = fmin(sum_of_squares, [0,1], args=(x,y))
b0, b1


Optimization terminated successfully.
         Current function value: 253.881329
         Iterations: 97
         Function evaluations: 179
Out[6]:
(89.123909209804239, -9.0094696583309499)

In [7]:
ax = lsd_and_math.plot(x='Drugs', y='Score', style='ro', legend=False, xlim=(0,8))
ax.plot([0,10], [b0, b0+b1*10])


Out[7]:
[<matplotlib.lines.Line2D at 0x10fab9da0>]

In [8]:
ax = lsd_and_math.plot(x='Drugs', y='Score', style='ro', legend=False, xlim=(0,8), ylim=(20, 90))
ax.plot([0,10], [b0, b0+b1*10])
for xi, yi in zip(x,y):
    ax.plot([xi]*2, [yi, b0+b1*xi], 'k:')


Alternative loss functions

Minimizing the sum of squares is not the only criterion we can use; it is just a very popular (and successful) one. For example, we can try to minimize the sum of absolute differences:


In [9]:
sum_of_absval = lambda theta, x, y: np.sum(np.abs(y - theta[0] - theta[1]*x))

In [10]:
b0, b1 = fmin(sum_of_absval, [0,0], args=(x,y))
print('\nintercept: {0:.2}, slope: {1:.2}'.format(b0,b1))
ax = lsd_and_math.plot(x='Drugs', y='Score', style='ro', legend=False, xlim=(0,8))
ax.plot([0,10], [b0, b0+b1*10])


Optimization terminated successfully.
         Current function value: 31.692066
         Iterations: 137
         Function evaluations: 264

intercept: 9e+01, slope: -9.3
Out[10]:
[<matplotlib.lines.Line2D at 0x10fc9e8d0>]

We are not restricted to a straight-line regression model; we can represent a curved relationship between our variables by introducing polynomial terms. For example, a cubic model:

$y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \epsilon_i$

In [11]:
sum_squares_quad = lambda theta, x, y: np.sum((y - theta[0] - theta[1]*x - theta[2]*(x**2)) ** 2)

In [12]:
b0,b1,b2 = fmin(sum_squares_quad, [1,1,-1], args=(x,y))
print('\nintercept: {0:.2}, x: {1:.2}, x2: {2:.2}'.format(b0,b1,b2))
ax = lsd_and_math.plot(x='Drugs', y='Score', style='ro', legend=False, xlim=(0,8))
xvals = np.linspace(0, 8, 100)
ax.plot(xvals, b0 + b1*xvals + b2*(xvals**2))


Optimization terminated successfully.
         Current function value: 251.093792
         Iterations: 177
         Function evaluations: 319

intercept: 9.2e+01, x: -1.1e+01, x2: 0.24
Out[12]:
[<matplotlib.lines.Line2D at 0x10fce6860>]

Although a polynomial model characterizes a nonlinear relationship, it is a linear problem in terms of estimation. That is, the regression model $f(y | x)$ is linear in the parameters.

For some data, it may be reasonable to consider polynomials of order>2. For example, consider the relationship between the number of spawning salmon and the number of juveniles recruited into the population the following year; one would expect the relationship to be positive, but not necessarily linear.


In [13]:
sum_squares_cubic = lambda theta, x, y: np.sum((y - theta[0] - theta[1]*x - theta[2]*(x**2) 
                                  - theta[3]*(x**3)) ** 2)

In [14]:
salmon = pd.read_table("../data/salmon.dat", delim_whitespace=True, index_col=0)
plt.plot(salmon.spawners, salmon.recruits, 'r.')
b0,b1,b2,b3 = fmin(sum_squares_cubic, [0,1,-1,0], args=(salmon.spawners, salmon.recruits))
xvals = np.arange(500)
plt.plot(xvals, b0 + b1*xvals + b2*(xvals**2) + b3*(xvals**3))


Optimization terminated successfully.
         Current function value: 12809.933906
         Iterations: 251
         Function evaluations: 437
Out[14]:
[<matplotlib.lines.Line2D at 0x10fe55b38>]

Linear Regression with scikit-learn

In practice, we need not fit least squares models by hand because they are implemented generally in packages such as scikit-learn and statsmodels. For example, scikit-learn package implements least squares models in its LinearRegression class:


In [15]:
from sklearn import linear_model

straight_line = linear_model.LinearRegression()
straight_line.fit(x[:, np.newaxis], y)


Out[15]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [16]:
straight_line.coef_


Out[16]:
array([-9.00946642])

In [17]:
straight_line.intercept_


Out[17]:
89.123873767993061

In [18]:
plt.plot(x, y, 'ro')
plt.plot(x, straight_line.predict(x[:, np.newaxis]), color='blue',
         linewidth=3)


Out[18]:
[<matplotlib.lines.Line2D at 0x1108231d0>]

For more general regression model building, its helpful to use a tool for describing statistical models, called patsy. With patsy, it is easy to specify the desired combinations of variables for any particular analysis, using an "R-like" syntax. patsy parses the formula string, and uses it to construct the approriate design matrix for the model.

For example, the quadratic model specified by hand above can be coded as:


In [19]:
from patsy import dmatrix

X = dmatrix('salmon.spawners + I(salmon.spawners**2)')

The dmatrix function returns the design matrix, which can be passed directly to the LinearRegression fitting method.


In [20]:
poly_line = linear_model.LinearRegression(fit_intercept=False)
poly_line.fit(X, salmon.recruits)


Out[20]:
LinearRegression(copy_X=True, fit_intercept=False, n_jobs=1, normalize=False)

In [21]:
poly_line.coef_


Out[21]:
array([  2.13819247e+01,   9.59677283e-01,  -8.39694795e-04])

In [22]:
plt.plot(salmon.spawners, salmon.recruits, 'ro')
frye_range = np.arange(500)
plt.plot(frye_range, poly_line.predict(dmatrix('frye_range + I(frye_range**2)')), color='blue')


Out[22]:
[<matplotlib.lines.Line2D at 0x110aabf98>]

Generalized linear models

Often our data violates one or more of the linear regression assumptions:

  • non-linear
  • non-normal error distribution
  • heteroskedasticity

this forces us to generalize the regression model in order to account for these characteristics.

As a motivating example, consider the Olympic medals data that we compiled earlier in the tutorial.


In [23]:
medals = pd.read_csv('../data/medals.csv')
medals.head()


Out[23]:
medals population oecd log_population
0 1 96165 0 11.473821
1 1 281584 0 12.548186
2 6 2589043 0 14.766799
3 25 10952046 0 16.209037
4 41 18348078 1 16.725035

We expect a positive relationship between population and awarded medals, but the data in their raw form are clearly not amenable to linear regression.


In [24]:
medals.plot(x='population', y='medals', kind='scatter')


Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x110abf7f0>

Part of the issue is the scale of the variables. For example, countries' populations span several orders of magnitude. We can correct this by using the logarithm of population, which we have already calculated.


In [25]:
medals.plot(x='log_population', y='medals', kind='scatter')


Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x10fc9a048>

This is an improvement, but the relationship is still not adequately modeled by least-squares regression.


In [26]:
linear_medals = linear_model.LinearRegression()
X = medals.log_population[:, np.newaxis]
linear_medals.fit(X, medals.medals)


Out[26]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [27]:
ax = medals.plot(x='log_population', y='medals', kind='scatter')
ax.plot(medals.log_population, linear_medals.predict(X), color='red',
         linewidth=2)


Out[27]:
[<matplotlib.lines.Line2D at 0x110b0e208>]

This is due to the fact that the response data are counts. As a result, they tend to have characteristic properties.

  • discrete
  • positive
  • variance grows with mean

to account for this, we can do two things: (1) model the medal count on the log scale and (2) assume Poisson errors, rather than normal.

Recall the Poisson distribution from the previous section:

$$p(y)=\frac{e^{-\lambda}\lambda^y}{y!}$$
  • $Y=\{0,1,2,\ldots\}$
  • $\lambda > 0$
$$E(Y) = \text{Var}(Y) = \lambda$$

So, we will model the logarithm of the expected value as a linear function of our predictors:

$$\log(\lambda) = X\beta$$

In this context, the log function is called a link function. This transformation implies the mean of the Poisson is:

$$\lambda = \exp(X\beta)$$

We can plug this into the Poisson likelihood and use maximum likelihood to estimate the regression covariates $\beta$.

$$\log L = \sum_{i=1}^n -\exp(X_i\beta) + Y_i (X_i \beta)- \log(Y_i!)$$

As we have already done, we just need to code the kernel of this likelihood, and optimize!


In [28]:
# Poisson negative log-likelhood
poisson_loglike = lambda beta, X, y: -(-np.exp(X.dot(beta)) + y*X.dot(beta)).sum()

Let's use the assign method to add a column of ones to the design matrix.


In [29]:
poisson_loglike([0,1], medals[['log_population']].assign(intercept=1), medals.medals)


Out[29]:
-627.255735551737

We will use Nelder-Mead to minimize the negtive log-likelhood.


In [30]:
b1, b0 = fmin(poisson_loglike, [0,1], args=(medals[['log_population']].assign(intercept=1).values, 
                                            medals.medals.values))


Optimization terminated successfully.
         Current function value: -1381.299433
         Iterations: 68
         Function evaluations: 131

In [31]:
b0, b1


Out[31]:
(-5.2973029170604393, 0.44873025169011005)

The resulting fit looks reasonable.


In [32]:
ax = medals.plot(x='log_population', y='medals', kind='scatter')
xvals = np.arange(12, 22)
ax.plot(xvals, np.exp(b0 + b1*xvals), 'r--')


Out[32]:
[<matplotlib.lines.Line2D at 0x110ba9fd0>]

Exercise: Multivariate GLM

Add the OECD indicator variable to the model, and estimate the model coefficients.


In [33]:
# Write your answer here

Interactions among variables

Interactions imply that the effect of one covariate $X_1$ on $Y$ depends on the value of another covariate $X_2$.

$$M(Y|X) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 +\beta_3 X_1 X_2$$

the effect of a unit increase in $X_1$:

$$M(Y|X_1+1, X_2) - M(Y|X_1, X_2)$$$$\begin{align} &= \beta_0 + \beta_1 (X_1 + 1) + \beta_2 X_2 +\beta_3 (X_1 + 1) X_2 - [\beta_0 + \beta_1 X_1 + \beta_2 X_2 +\beta_3 X_1 X_2] \\ &= \beta_1 + \beta_3 X_2 \end{align}$$

In [34]:
ax = medals[medals.oecd==1].plot(x='log_population', y='medals', kind='scatter', alpha=0.8)
medals[medals.oecd==0].plot(x='log_population', y='medals', kind='scatter', color='red', alpha=0.5, ax=ax)


Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x110c9e160>

Interaction can be interpreted as:

  • $X_1$ interacts with $X_2$
  • $X_1$ modifies the effect of $X_2$
  • $X_2$ modifies the effect of $X_1$
  • $X_1$ and $X_2$ are non-additive or synergistic

Let's construct a model that predicts medal count based on population size and OECD status, as well as the interaction. We can use patsy to set up the design matrix.


In [35]:
y = medals.medals
X = dmatrix('log_population * oecd', data=medals)
X


Out[35]:
DesignMatrix with shape (79, 4)
  Intercept  log_population  oecd  log_population:oecd
          1        11.47382     0              0.00000
          1        12.54819     0              0.00000
          1        14.76680     0              0.00000
          1        16.20904     0              0.00000
          1        16.72504     1             16.72504
          1        16.14509     1             16.14509
          1        15.91733     0              0.00000
          1        13.99525     0              0.00000
          1        15.10232     1             15.10232
          1        15.29285     1             15.29285
          1        16.15816     0              0.00000
          1        16.55847     1             16.55847
          1        14.31108     0              0.00000
          1        15.47603     1             15.47603
          1        15.10551     1             15.10551
          1        16.14891     1             16.14891
          1        14.46491     0              0.00000
          1        15.78927     1             15.78927
          1        15.99697     1             15.99697
          1        16.93468     0              0.00000
          1        18.22083     1             18.22083
          1        15.44896     1             15.44896
          1        16.16795     1             16.16795
          1        17.21615     1             17.21615
          1        16.01637     0              0.00000
          1        16.64175     0              0.00000
          1        17.88266     1             17.88266
          1        17.86445     1             17.86445
          1        17.63646     0              0.00000
          1        16.13364     1             16.13364
  [49 rows omitted]
  Terms:
    'Intercept' (column 0)
    'log_population' (column 1)
    'oecd' (column 2)
    'log_population:oecd' (column 3)
  (to view full data, use np.asarray(this_obj))

Now, fit the model.


In [36]:
interaction_params = fmin(poisson_loglike, [0,1,1,0], args=(X, y))


Warning: Maximum number of function evaluations has been exceeded.

In [37]:
interaction_params


Out[37]:
array([-5.2958393 ,  0.42441607, -2.16727213,  0.17978487])

Notice anything odd about these estimates?

The main effect of the OECD effect is negative, which seems counter-intuitive. This is because the variable is interpreted as the OECD effect when the log-population is zero. This is not particularly meaningful.

We can improve the interpretability of this parameter by centering the log-population variable prior to entering it into the model. This will result in the OECD main effect being interpreted as the marginal effect of being an OECD country for an average-sized country.


In [38]:
y = medals.medals
X = dmatrix('center(log_population) * oecd', data=medals)
X


Out[38]:
DesignMatrix with shape (79, 4)
  Intercept  center(log_population)  oecd  center(log_population):oecd
          1                -5.03303     0                     -0.00000
          1                -3.95866     0                     -0.00000
          1                -1.74005     0                     -0.00000
          1                -0.29781     0                     -0.00000
          1                 0.21819     1                      0.21819
          1                -0.36176     1                     -0.36176
          1                -0.58952     0                     -0.00000
          1                -2.51159     0                     -0.00000
          1                -1.40453     1                     -1.40453
          1                -1.21400     1                     -1.21400
          1                -0.34869     0                     -0.00000
          1                 0.05163     1                      0.05163
          1                -2.19577     0                     -0.00000
          1                -1.03081     1                     -1.03081
          1                -1.40133     1                     -1.40133
          1                -0.35793     1                     -0.35793
          1                -2.04194     0                     -0.00000
          1                -0.71758     1                     -0.71758
          1                -0.50988     1                     -0.50988
          1                 0.42783     0                      0.00000
          1                 1.71398     1                      1.71398
          1                -1.05789     1                     -1.05789
          1                -0.33889     1                     -0.33889
          1                 0.70930     1                      0.70930
          1                -0.49047     0                     -0.00000
          1                 0.13490     0                      0.00000
          1                 1.37582     1                      1.37582
          1                 1.35760     1                      1.35760
          1                 1.12961     0                      0.00000
          1                -0.37321     1                     -0.37321
  [49 rows omitted]
  Terms:
    'Intercept' (column 0)
    'center(log_population)' (column 1)
    'oecd' (column 2)
    'center(log_population):oecd' (column 3)
  (to view full data, use np.asarray(this_obj))

In [39]:
fmin(poisson_loglike, [0,1,1,0], args=(X, y))


Optimization terminated successfully.
         Current function value: -1492.092233
         Iterations: 303
         Function evaluations: 513
Out[39]:
array([ 1.71003145,  0.42443035,  0.80033982,  0.17976563])

Model Selection

How do we choose among competing models for a given dataset? More parameters are not necessarily better, from the standpoint of model fit. For example, fitting a 6th order polynomial to the LSD example certainly results in an overfit.


In [40]:
def calc_poly(params, data):
        x = np.c_[[data**i for i in range(len(params))]]
        return np.dot(params, x)

x, y = lsd_and_math.T.values
    
sum_squares_poly = lambda theta, x, y: np.sum((y - calc_poly(theta, x)) ** 2)
betas = fmin(sum_squares_poly, np.zeros(7), args=(x,y), maxiter=1e6)
plt.plot(x, y, 'ro')
xvals = np.linspace(0, max(x), 100)
plt.plot(xvals, calc_poly(betas, xvals))


Optimization terminated successfully.
         Current function value: 304.786763
         Iterations: 830
         Function evaluations: 1284
Out[40]:
[<matplotlib.lines.Line2D at 0x110d669e8>]

One approach is to use an information-theoretic criterion to select the most appropriate model. For example Akaike's Information Criterion (AIC) balances the fit of the model (in terms of the likelihood) with the number of parameters required to achieve that fit. We can easily calculate AIC as:

$$AIC = n \log(\hat{\sigma}^2) + 2p$$

where $p$ is the number of parameters in the model and $\hat{\sigma}^2 = RSS/(n-p-1)$.

Notice that as the number of parameters increase, the residual sum of squares goes down, but the second term (a penalty) increases.

AIC is a metric of information distance between a given model and a notional "true" model. Since we don't know the true model, the AIC value itself is not meaningful in an absolute sense, but is useful as a relative measure of model quality.

To apply AIC to model selection, we choose the model that has the lowest AIC value.


In [41]:
n = len(x)

aic = lambda rss, p, n: n * np.log(rss/(n-p-1)) + 2*p

RSS1 = sum_of_squares(fmin(sum_of_squares, [0,1], args=(x,y)), x, y)
RSS2 = sum_squares_quad(fmin(sum_squares_quad, [1,1,-1], args=(x,y)), x, y)

print('\nModel 1: {0}\nModel 2: {1}'.format(aic(RSS1, 2, n), aic(RSS2, 3, n)))


Optimization terminated successfully.
         Current function value: 253.881329
         Iterations: 97
         Function evaluations: 179
Optimization terminated successfully.
         Current function value: 251.093792
         Iterations: 177
         Function evaluations: 319

Model 1: 33.05400811127588
Model 2: 36.99049978705

Hence, on the basis of "information distance", we would select the 2-parameter (linear) model.

Exercise: Olympic medals model selection

Use AIC to select the best model from the following set of Olympic medal prediction models:

  • population only
  • population and OECD
  • interaction model

For these models, use the alternative form of AIC, which uses the log-likelhood rather than the residual sums-of-squares:

$$AIC = -2 \log(L) + 2p$$

In [42]:
# Write your answer here

Logistic Regression

Fitting a line to the relationship between two variables using the least squares approach is sensible when the variable we are trying to predict is continuous, but what about when the data are dichotomous?

  • male/female
  • pass/fail
  • died/survived

Let's consider the problem of predicting survival in the Titanic disaster, based on our available information. For example, lets say that we want to predict survival as a function of the fare paid for the journey.


In [43]:
titanic = pd.read_excel("../data/titanic.xls", "titanic")
titanic.name


Out[43]:
0                           Allen, Miss. Elisabeth Walton
1                          Allison, Master. Hudson Trevor
2                            Allison, Miss. Helen Loraine
3                    Allison, Mr. Hudson Joshua Creighton
4         Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
5                                     Anderson, Mr. Harry
6                       Andrews, Miss. Kornelia Theodosia
7                                  Andrews, Mr. Thomas Jr
8           Appleton, Mrs. Edward Dale (Charlotte Lamson)
9                                 Artagaveytia, Mr. Ramon
10                                 Astor, Col. John Jacob
11      Astor, Mrs. John Jacob (Madeleine Talmadge Force)
12                          Aubart, Mme. Leontine Pauline
13                           Barber, Miss. Ellen "Nellie"
14                   Barkworth, Mr. Algernon Henry Wilson
15                                    Baumann, Mr. John D
16                               Baxter, Mr. Quigg Edmond
17        Baxter, Mrs. James (Helene DeLaudeniere Chaput)
18                                  Bazzani, Miss. Albina
19                                   Beattie, Mr. Thomson
20                          Beckwith, Mr. Richard Leonard
21       Beckwith, Mrs. Richard Leonard (Sallie Monypeny)
22                                  Behr, Mr. Karl Howell
23                                  Bidois, Miss. Rosalie
24                                      Bird, Miss. Ellen
25                                    Birnbaum, Mr. Jakob
26                                Bishop, Mr. Dickinson H
27                Bishop, Mrs. Dickinson H (Helen Walton)
28                                 Bissette, Miss. Amelia
29              Bjornstrom-Steffansson, Mr. Mauritz Hakan
                              ...                        
1279                 Vestrom, Miss. Hulda Amanda Adolfina
1280                                      Vovk, Mr. Janko
1281                                 Waelens, Mr. Achille
1282                                  Ware, Mr. Frederick
1283                          Warren, Mr. Charles William
1284                                    Webber, Mr. James
1285                                  Wenzel, Mr. Linhart
1286      Whabee, Mrs. George Joseph (Shawneene Abi-Saab)
1287                     Widegren, Mr. Carl/Charles Peter
1288                            Wiklund, Mr. Jakob Alfred
1289                              Wiklund, Mr. Karl Johan
1290                     Wilkes, Mrs. James (Ellen Needs)
1291                     Willer, Mr. Aaron ("Abi Weller")
1292                                   Willey, Mr. Edward
1293                    Williams, Mr. Howard Hugh "Harry"
1294                                 Williams, Mr. Leslie
1295                                  Windelov, Mr. Einar
1296                                     Wirz, Mr. Albert
1297                               Wiseman, Mr. Phillippe
1298                            Wittevrongel, Mr. Camille
1299                                  Yasbeck, Mr. Antoni
1300              Yasbeck, Mrs. Antoni (Selini Alexander)
1301                                 Youseff, Mr. Gerious
1302                                    Yousif, Mr. Wazli
1303                                Yousseff, Mr. Gerious
1304                                 Zabour, Miss. Hileni
1305                                Zabour, Miss. Thamine
1306                            Zakarian, Mr. Mapriededer
1307                                  Zakarian, Mr. Ortin
1308                                   Zimmerman, Mr. Leo
Name: name, dtype: object

In [44]:
jitter = np.random.normal(scale=0.02, size=len(titanic))

plt.scatter(np.log(titanic.fare), titanic.survived + jitter, alpha=0.3)
plt.yticks([0,1])
plt.ylabel("survived")
plt.xlabel("log(fare)")


Out[44]:
<matplotlib.text.Text at 0x1110c3eb8>

I have added random jitter on the y-axis to help visualize the density of the points, and have plotted fare on the log scale.

Clearly, fitting a line through this data makes little sense, for several reasons. First, for most values of the predictor variable, the line would predict values that are not zero or one. Second, it would seem odd to choose least squares (or similar) as a criterion for selecting the best line.


In [45]:
x = np.log(titanic.fare[titanic.fare>0])
y = titanic.survived[titanic.fare>0]
betas_titanic = fmin(sum_of_squares, [1,1], args=(x,y))


Optimization terminated successfully.
         Current function value: 277.621917
         Iterations: 55
         Function evaluations: 103

In [46]:
jitter = np.random.normal(scale=0.02, size=len(titanic))
plt.scatter(np.log(titanic.fare), titanic.survived + jitter, alpha=0.3)
plt.yticks([0,1])
plt.ylabel("survived")
plt.xlabel("log(fare)")
plt.plot([0,7], [betas_titanic[0], betas_titanic[0] + betas_titanic[1]*7.])


Out[46]:
[<matplotlib.lines.Line2D at 0x111014470>]

If we look at this data, we can see that for most values of fare, there are some individuals that survived and some that did not. However, notice that the cloud of points is denser on the "survived" (y=1) side for larger values of fare than on the "died" (y=0) side.

Stochastic model

Rather than model the binary outcome explicitly, it makes sense instead to model the probability of death or survival in a stochastic model. Probabilities are measured on a continuous [0,1] scale, which may be more amenable for prediction using a regression line. We need to consider a different probability model for this exerciese however; let's consider the Bernoulli distribution as a generative model for our data:

$$f(y|p) = p^{y} (1-p)^{1-y}$$

where $y = \{0,1\}$ and $p \in [0,1]$. So, this model predicts whether $y$ is zero or one as a function of the probability $p$. Notice that when $y=1$, the $1-p$ term disappears, and when $y=0$, the $p$ term disappears.

So, the model we want to fit should look something like this:

$$p_i = \beta_0 + \beta_1 x_i + \epsilon_i$$

However, since $p$ is constrained to be between zero and one, it is easy to see where a linear (or polynomial) model might predict values outside of this range. As with the Poisson regression, we can modify this model slightly by using a link function to transform the probability to have an unbounded range on a new scale. Specifically, we can use a logit transformation as our link function:

$$\text{logit}(p) = \log\left[\frac{p}{1-p}\right] = x$$

Here's a plot of $p/(1-p)$


In [47]:
logit = lambda p: np.log(p/(1.-p))
unit_interval = np.linspace(0,1)
plt.plot(unit_interval/(1-unit_interval), unit_interval)
plt.xlabel(r'$p/(1-p)$')
plt.ylabel('p');


And here's the logit function:


In [48]:
plt.plot(logit(unit_interval), unit_interval)
plt.xlabel('logit(p)')
plt.ylabel('p');


The inverse of the logit transformation is:

$$p = \frac{1}{1 + \exp(-x)}$$

In [49]:
invlogit = lambda x: 1. / (1 + np.exp(-x))

So, now our model is:

$$\text{logit}(p_i) = \beta_0 + \beta_1 x_i + \epsilon_i$$

We can fit this model using maximum likelihood. Our likelihood, again based on the Bernoulli model is:

$$L(y|p) = \prod_{i=1}^n p_i^{y_i} (1-p_i)^{1-y_i}$$

which, on the log scale is:

$$l(y|p) = \sum_{i=1}^n y_i \log(p_i) + (1-y_i)\log(1-p_i)$$

We can easily implement this in Python, keeping in mind that fmin minimizes, rather than maximizes functions:


In [50]:
def logistic_like(theta, x, y):
    
    p = invlogit(theta[0] + theta[1] * x)
    
    # Return negative of log-likelihood
    return -np.sum(y * np.log(p) + (1-y) * np.log(1 - p))

Remove null values from variables (a bad idea, which we will show later) ...


In [51]:
x, y = titanic[titanic.fare.notnull()][['fare', 'survived']].values.T

... and fit the model.


In [52]:
b0, b1 = fmin(logistic_like, [0.5,0], args=(x,y))
b0, b1


Optimization terminated successfully.
         Current function value: 827.015955
         Iterations: 47
         Function evaluations: 93
Out[52]:
(-0.88238984528338194, 0.012452067664164127)

In [53]:
jitter = np.random.normal(scale=0.01, size=len(x))
plt.plot(x, y+jitter, 'r.', alpha=0.3)
plt.yticks([0,.25,.5,.75,1])
xvals = np.linspace(0, 600)
plt.plot(xvals, invlogit(b0+b1*xvals))


Out[53]:
[<matplotlib.lines.Line2D at 0x1113b8e80>]

As with our least squares model, we can easily fit logistic regression models in scikit-learn, in this case using the LogisticRegression.


In [54]:
logistic = linear_model.LogisticRegression()
logistic.fit(x[:, np.newaxis], y)


Out[54]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0)

In [55]:
logistic.coef_


Out[55]:
array([[ 0.0123847]])

Exercise: multivariate logistic regression

Which other variables might be relevant for predicting the probability of surviving the Titanic? Generalize the model likelihood to include 2 or 3 other covariates from the dataset.


In [56]:
# Write your answer here